-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add transaction logs and basic conflict resolution #403
Conversation
ef8ebfb
to
cb287a1
Compare
Fix a couple of minor bugs too
Also, better performance by avoiding a big clone of the ChangeSet and more tests.
cb287a1
to
c1178a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seba this looks amazing. I gave this what I would call a "shallow" review. Overall the approach makes sense, the data structures look right, tests are legible, etc.
The following observation is in no way a blocker for moving forward with this PR...
I wonder about the overall design of the transaction logs. Right now the logs are useful for detecting conflicts only, but they don't have the full information from the changeset. I'm curious whether it would make sense to just serialize the full changeset as the transaction log. This is what lance does: https://lancedb.github.io/lance/format.html
The advantage of this is that you could potentially recover an entire session if a commit fails. (Currently this information just lives in memory.) The downside of course is that the changeset is much larger and more expensive to store / parse.
#[allow(clippy::panic)] | ||
VersionSelection::Fail => panic!("Bug in conflict resolution: ChunkDoubleUpdate flagged as unrecoverable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying we will never actually hit this code path in normal usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this possibility is filtered out above, but the compilers doesn't know it.
for snap_id in new_commits.into_iter().rev() { | ||
let tx_log = self.storage.fetch_transaction_log(&snap_id).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thought about this loop which is probably a very premature optimization...
Rather than fetching each commit one at a time and computing the conflicts one at a time, couldn't we fetch all of the transaction logs (in one async gather), merge them together, and compute conflicts once?
This optimization could be important if there are lots of commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question. That's what I had initially, and it's what sounded more intuitive to me too. Following is why I gave up on it on this first pass, we can always go back eventually:
- The logic o merging the transaction logs is not trivial. In fact, it looks a lot like the conflict resolution logic. Example: if one commit deletes an array and the next commit creates a group on the same path, and edits the attributes, etc. It gets very messy, and it will be pretty crazy once we introduce
rename
- In the current design I was very careful in that if you are able to rebase only half of the commits, you get back a repo rebased to that point, you can change your policy (or do some manual changes) and continue with another rebase. If we initially merge all the tx logs it's all or nothing.
- Memory usage is much larger if you have to merge many tx logs.
- Error reporting becomes untractable under tx log merge
I should have written a design document with all these details, but I was to bored of this problem already hehe.
As i mentioned somewhere, I don't feel I got to the best design for this problem. I'm pretty sure we'll have to revisit soon.
@@ -725,6 +734,63 @@ impl Repository { | |||
} | |||
} | |||
|
|||
pub async fn rebase( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment about what exactly this important function does would be very helpful.
My interpretation is that, if successful, it mutates the state of the repo such that it can commit successfully to the specified branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your interpretation is correct. I'll document, and include the tricky case of "partial success" which is important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pub async fn rebase( | ||
&mut self, | ||
solver: &dyn ConflictSolver, | ||
update_branch_name: &str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default should be to update against the repo's current active branch. I can see always having to specify this explicitly as a potential pain point for users. 95% of the time there will just be a single branch active in a repo.
Edit: I guess this is just a general feature of the Rust API, so probably not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is the low level API, and there is no notion of current branch at this level, but we will definitely make it the default (or only option) in the python bindings.
repo2.rebase(&ConflictDetector, "main").await, | ||
); | ||
Ok(()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are very clear and easy to follow.
repo2 | ||
.set_user_attributes( | ||
path.clone(), | ||
Some(UserAttributes::try_new(br#"{"foo":"bar"}"#).unwrap()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that this conflict should actually be resolvable since the changes are the same! 😆 But I understand of course that sort of resolution is not implemented yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!!! This is a bit more advanced feature that I want us to have eventually. I made it so to get the test failure once we implement.
} | ||
|
||
#[tokio::test()] | ||
async fn test_rebase_without_fast_forward() -> Result<(), Box<dyn Error>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was hard for me to understand what this is testing just by reading the code. Consider adding some comments to this and the next test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, I'll add a brief comment to every test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
Ok(()) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment: coming from python, pytest, and the wonderful world of test parametrization, these tests feel very verbose and somewhat repetitive. Not sure if there is any remedy for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it's painful ... people tend to use macros. But in general, our tests are bad, too verbose, no helper functions. I want us to refactor all of them into something much more readable and easy to write. We have #380 to work on it.
pub deleted_arrays: HashSet<NodeId>, | ||
pub updated_user_attributes: HashSet<NodeId>, | ||
pub updated_zarr_metadata: HashSet<NodeId>, | ||
pub updated_chunks: HashMap<NodeId, HashSet<ChunkIndices>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does updated chunks include deleted chunks?
More generally, how do we handle deleted chunks? Do we treat them the same as any other chunk write?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's included there. For conflict detection today, we are treating it as any other write, which is not great. If they both delete a chunk it shouldn't be a conflict. But also, easy to fix with the default UseOurs
policy. We can revisit later unless you have an important use case I haven't thought about.
Co-authored-by: Ryan Abernathey <[email protected]>
It's definitely a valid point in the design space. This was my reasoning to chose the current style:
I wish I had written a design doc... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is impressive and the tests are really easy to follow along to
pub struct TransactionLog { | ||
// FIXME: better, more stable on-disk format | ||
pub icechunk_transaction_log_format_version: IcechunkFormatVersion, | ||
pub new_groups: HashSet<NodeId>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So i understand correctly, we keep NodeId
for the changes, which the Repository
can then (in the future) resolve into a more readable or custom format by resolving the zarr path from the ID in the given manifest right? Does the NodeId refer to the nodes id in the incoming changeset or from the existing changset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. NodeIds are slightly better than paths, because they are random, so something like create /foo -> delete /foo -> create /foo
doesn't get tricky.
This is the NodeId assigned in the change that generate this transaction, so in your description, the existing changeset.
f75c7f1
to
94e9c64
Compare
No description provided.